Goto

Collaborating Authors

 fine-grained complexity


On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Neural Information Processing Systems

Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.


The Fine-Grained Complexity of Gradient Computation for Training Large Language Models

Neural Information Processing Systems

Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run forward' computations and backward computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis \mathsf{SETH} is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training.


Reviews: On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Neural Information Processing Systems

The paper make use of (relatively) recent advances in complexity theory to show that of many common learning problems do not allow subquadratic time learning algorithms (given the veracity of the "Strong Exponential Time Hypothesis"). I appreciate that the authors do not oversell their results: They clearly state that they provide a worst-case analysis. Also, the results are not surprising. For instance, finding the exact solution of any kernel method requires the computation of the full kernel matrix, which is already quadratic in number of training examples. Reducing this computation time would imply that one can compute an approximation of the exact solution without computing the full kernel matrix, which is intuitively unlikely, unless he makes extra assumptions on the problem structure (i.e., the nature of the data-generating distribution).


Fundamental Limitations on Subquadratic Alternatives to Transformers

Alman, Josh, Yu, Hantao

arXiv.org Artificial Intelligence

The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.


Fine-Grained Complexity and Algorithms for the Schulze Voting Method

Sornat, Krzysztof, Williams, Virginia Vassilevska, Xu, Yinzhan

arXiv.org Artificial Intelligence

We study computational aspects of a well-known single-winner voting rule called the Schulze method [Schulze, 2003] which is used broadly in practice. In this method the voters give (weak) ordinal preference ballots which are used to define the weighted majority graph (WMG) of direct comparisons between pairs of candidates. The choice of the winner comes from indirect comparisons in the graph, and more specifically from considering directed paths instead of direct comparisons between candidates. When the input is the WMG, to our knowledge, the fastest algorithm for computing all possible winners in the Schulze method uses a folklore reduction to the All-Pairs Bottleneck Paths (APBP) problem and runs in $O(m^{2.69})$ time, where $m$ is the number of candidates. It is an interesting open question whether this can be improved. Our first result is a combinatorial algorithm with a nearly quadratic running time for computing all possible winners. If the input to the possible winners problem is not the WMG but the preference profile, then constructing the WMG is a bottleneck that increases the running time significantly; in the special case when there are $O(m)$ voters and candidates, the running time becomes $O(m^{2.69})$, or $O(m^{2.5})$ if there is a nearly-linear time algorithm for multiplying dense square matrices. To address this bottleneck, we prove a formal equivalence between the well-studied Dominance Product problem and the problem of computing the WMG. We prove a similar connection between the so called Dominating Pairs problem and the problem of verifying whether a given candidate is a possible winner. Our paper is the first to bring fine-grained complexity into the field of computational social choice. Using it we can identify voting protocols that are unlikely to be practical for large numbers of candidates and/or voters, as their complexity is likely, say at least cubic.


On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Backurs, Arturs, Indyk, Piotr, Schmidt, Ludwig

Neural Information Processing Systems

Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time.


Finding the true potential of algorithms

#artificialintelligence

Each semester, Associate Professor Virginia Vassilevska Williams tries to impart one fundamental lesson to her computer-science undergraduates: Math is the foundation of everything. Often, students come into Williams' class, 6.006 (Introduction to Algorithms), wanting to dive into advanced programming that power the latest, greatest computing techniques. Her lessons instead focus on how algorithms are designed around core mathematical models and concepts. "When taking an algorithms class, many students expect to program a lot and perhaps use deep learning, but it's very mathematical and has very little programming," says Williams, the Steven G. (1968) and Renee Finn Career Development Professor who recently earned tenure in the Department of Electrical Engineering and Computer Science. "We don't have much time together in class (only two hours a week), but I hope in that time they get to see a little of the beauty of math -- because math allows you to see how and why everything works together. It really is a beautiful thing."